Skip to content

Conversation

MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Jul 4, 2025

When a follower comes up with an entirely empty disk, it can vote for any server to become leader which could result in data loss. For example with a R3 stream. Server A is leader and has stored all JetStream messages. Server B was part of quorum, but didn't apply all JetStream messages yet. Server C has stored some JetStream messages, but it didn't get the messages that had quorum on Server A and Server B.

If we'd shutdown Server B and reset the disk, we would violate stable storage which Raft relies on. If Server A goes down now, Server B can't become leader because it's empty (which is good), but Server B can vote for Server C to become leader. That would be bad, because we'd have lost some writes, and Server A would then have desynced.

By extension, scaling up of streams can be dangerous. Scaling up a R1 stream to R3 will work most of the time, but technically there's a very short time frame where the stream can be assigned on the two other servers, and the previous R1 leader immediately dies right after. Now the remaining servers that got assigned to be part of the R3 stream can vote for each other to become leader, which will lose the full stream contents.

This PR proposes to add protection and special handling when a server's log is empty.

  • When a server's disk or log is fully reset, while its log is empty, it will vote using an "empty vote". This is a special kind of vote that the candidate will not count against normal quorum. Instead, empty votes can only be used to become leader if the candidate received votes from ALL servers. This protects against data loss even when only a single server hosts the data, as long as all servers can be available at the same time to sort out this reset issue.
  • When an asset is scaling up, the Raft group will have ScaleUp: true set. This ensures servers that are new and part of the scale up go into observer mode. This ensures they can't become leader, but relaxes the above "empty vote" check. This ensures a candidate can still become leader based on normal quorum, and doesn't require votes from all servers.
  • Also, when an asset is initialized the first time the "empty vote" check is also relaxed. Which ensures if you create a R3 stream/consumer, you can still rely solely on quorum and don't need all servers to be available at the same time.

Like I mentioned before, technically resetting a disk is already a violation of Raft, so we could call it a day there. But I feel like it's still a good thing to at least try to attempt to preserve the data, even in such a nightmarish scenario.

TL;DR

  • meta/stream/consumer still require a quorum to store data
  • meta/stream/consumer still require a quorum to be voted as leader
  • only if a disk is fully reset, would meta/stream/consumer be put into a special mode that uses "empty votes", which ensures data can't be lost as long as at least 1 server has the data, and ALL servers are temporarily available at the same time to sort out this issue. After which we'd rely on quorum again, as usual.

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/scaleup-desync branch from a1aacee to be29939 Compare July 7, 2025 14:11
@MauriceVanVeen MauriceVanVeen changed the title (2.12) NRG: Empty log & partial catchup protection (2.12) NRG: Empty log protection Jul 7, 2025
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/scaleup-desync branch 2 times, most recently from 3247572 to 1a91162 Compare July 8, 2025 08:24
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review July 8, 2025 10:59
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner July 8, 2025 10:59
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/scaleup-desync branch from 1a91162 to cc5f1ff Compare July 8, 2025 15:20
@MauriceVanVeen
Copy link
Member Author

Just realized, this also makes replicated in-memory streams a lot safer to use. (Since they don't have any stable storage guarantees at all)
Before this PR a 3-node cluster with a R3 memory-stream is only guaranteed to not lose data if not more than one server is down at the same time. After this change, as long as one server still holds the data and can replicate it.. two servers could be restarted and the data would still be safe.

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just one thing.

Log WAL
Track bool
Observer bool
Recovering bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments explaining the Recovering and ScaleUp side effects here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Recovering must be set for a Raft group that's recovering after a restart, or if it's 
// first seen after a catchup from another server. If a server recovers with an empty log,
// we know to protect against data loss.
Recovering bool

// ScaleUp identifies the Raft peer set is being scaled up.
// We need to protect against losing state due to the new peers starting with an empty log.
// Therefore, these empty servers can't try to become leader until they at least have _some_ state.
ScaleUp bool

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/scaleup-desync branch from cc5f1ff to 374d3cc Compare July 10, 2025 12:53
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@neilalexander neilalexander merged commit d72665c into main Jul 10, 2025
65 of 70 checks passed
@neilalexander neilalexander deleted the maurice/scaleup-desync branch July 10, 2025 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants